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An  associative  control  process  (ACPi  network  is  a  learning  control  system  that 
can  reproduce  a  variety  of  animal  learning  results  from  classical  and 
instrumental  conditioning  experiments  (Ktopf,  Morgan,  b  Weaver,  1993;  see 
also  the  article,  "A  Hierarchical  Network  of  Control  Systems  that  Learn"}.  The 
ACP  networks  proposed  and  tested  by  lUopf,  Morgan,  and  Weaver  are  not 
guaranteed,  however,  to  learn  optima!  policies  for  maximizing  reinforcement. 
Optima!  behavior  is  guaranteed  for  a  reinforcement  learning  system  such  as 
Q-learning  (Watkins,  1989),  but  simple  Q-learning  is  incapable  of  reproducing 
the  animal  learning  results  that  ACP  networks  reproduce.  We  propose  two  new 
models  that  reproduce  the  animal  learning  results  and  are  provably  optimal. 

The  first  model,  the  modified  ACP  networig  embodies  the  smallest  number  of 
changes  necessary  to  the  ACP  network  to  guarantee  that  optima! policies  will  be 
learned  while  still  reproducing  the  animal  learning  results.  The  second  model, 
the  single-layer  ACP  networit  embodies  the  smallest  number  of  changes 
necessary  to  Q-learning  to  guarantee  that  it  reproduces  the  animal  learning 
results  while  still  learning  optima! policies.  We  also  propose  a  hierarchical 
network  architecture  within  which  several  reinforcement  learning  systems  (e.g., 
Q-learning  systems,  single-layer  ACP  networks,  or  any  other  learning 
controller)  can  be  combined  in  a  hierarchy.  We  implement  the  hierarchical 
network  architecture  by  combining  four  of  the  single-layer  ACP  networks  to 
form  a  controller  for  a  standard  inverted  pendulum  dynamic  control  problem. 
The  hierarchical  controller  is  shown  to  learn  more  reliably  and  more  than  an 
order  of  magnitude  faster  than  either  the  single-layer  ACP  network  or  the  Barto, 
Sutton,  and  Anderson  (1983)  learning  controller  for  the  benchmark  problem. 

Key  Words:  optima!  control,  learning,  Q-learning,  hierarchical  control 

1  Introduction 

An  associative  control  process  (ACP)  network  is  a  learning  control  system  that  can 
reproduce  a  variety  of  animal  learning  results  from  classical  and  instrumental  con- 
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ditioning  experiments  (Klopf,  Moigan,  &  Weaver,  1993;  see  also  the  article,  “A 
Hierarchical  Network  of  Control  Systems  that  Learn”).  The  ACP  networks  pro¬ 
posed  and  tested  by  Klopf,  Morgan,  and  Weaver  are  not  guaranteed,  however,  to 
learn  optimal  policies  for  maximizing  reinforcement.  These  ACP  networks  require 
that  training  be  conducted  in  trials  and  that  reinforcement  occur  only  at  the  end  of  a 
trial.  These  ACP  networks  cannot  handle  multiple  reinforcements  in  the  same  trial 
or  reinforcement  before  the  end  of  a  trial.  Also,  given  a  choice  of  several  different 
routes  to  reinforcement,  these  ACP  networks  are  not  guaranteed  to  find  the  shortest 
route.  Optimal  behavior  in  these  situations  is  guaranteed  for  a  reinforcement  learning 
system  such  as  Q-learning  (Watkins,  1989),  but  simple  Q-learning  carmot  reproduce 
the  animal  learning  results  that  ACP  networks  reproduce. 

We  propose  two  new  models  that  reproduce  the  animal  learning  results  and  are  also 
provably  optimal.  The  first  model,  the  modified  ACP  network,  embodies  the  smallest 
number  of  changes  necessary  to  the  ACP  network  to  guarantee  that  optimal  pohcies 
will  be  learned  while  still  reproducing  the  animal  learning  results.  The  second  model, 
the  single-layer  ACP  network,  embodies  the  smallest  number  of  changes  necessary  to 
Q-learning  to  guarantee  that  it  reproduces  the  animal  learning  results  while  still 
learning  optimal  policies.  The  two  models  have  identical  behavior  but  different 
internal  structure,  and  both  are  presented  in  order  to  illustrate  how  they  differ  from 
the  original  ACP  network  and  from  Q-learning. 

The  modified  ACP  network  and  the  single-layer  ACP  network  are  guaranteed 
to  learn  optimal  behavior  eventually  but,  like  Q-learning,  may  learn  very  slowly. 
We  propose  a  hierarchical  network  architecture  as  one  approach  for  increasing  the  speed 
of  learning.  This  is  an  architecture  within  which  several  reinforcement  learning 
systems  (e.g.,  Q-learning  systems,  single-layer  ACP  networks,  or  any  other  learning 
controller)  can  be  combined  in  a  hierarchy.  We  implement  the  hierarchical  network 
architecture  by  combining  four  of  the  single-layer  ACP  networks  to  form  a  controller 
for  a  benchmark  cart-pole  problem.  Using  this  standard  problem,  we  compare  the 
hierarchical  learning  controller’s  performance  with  that  of  other  learning  systems, 
including  that  of  the  Barto,  Sutton,  and  Anderson  (1983)  learning  controller.  We 
will  demonstrate  that  the  hierarchical  network  learns  more  reliably  and  the  training 
time  is  decreased  by  more  than  one  order  of  magnitude  for  this  problem. 

2  The  ACP  Network  Architecture 

The  original  ACP  network  was  proposed  by  Klopf,  Morgan,  and  Weaver  (1993) 
and  incorporates  the  drive-reinforcement  learning  mechanism  described  in  Klopf 
(1988).  This  section  describes  the  original  ACP  network,  which  is  then  modified 
and  simplified  in  a  subsequent  section.  An  ACP  network  (Fig.  1)  has  two  types 
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Sensor  Motor  Reinforcement  Reinforcement 

inputs  centers  centers  inputs 


Figure  1 

Associative  control  process  (ACP)  network  architecture.  Assuming  N  =  0  at  all  times,  there  is  only  one 
reinforcement  center  that  is  active,  the  positive  reinforcement  center  (PCI.  The  negative  reinforcement 
center  (NC)  is  never  active,  and  the  weights  associated  with  NC  never  change.  The  sure  is  sensed 
through  the  sensors,  and  the  actions  are  performed  based  on  the  outputs  of  the  motor  centers.  There  is 
one  motor  center  for  each  possible  action.  For  each  pair  consisting  of  one  sensor  and  one  motor  center, 
there  are  four  weights:  an  excitatory  and  an  inhibitory  weight  for  the  connections  horn  the  sensor  to  the 
motor  center  and  an  excitatory  and  an  inhibitory  weight  for  the  connections  6om  the  sensor  to  PC.  The 
latter  connections  are  facilitated  by  signals  from  the  motor  center. 


of  inputs:  a  pair  of  reinforcement  signals  (rectangles  on  the  right)  and  m  sensors 
(rectangles  on  the  left).  There  are  two  layers:  a  layer  consisting  of  n  motor  centers 
(circles  on  the  left)  and  a  layer  consisting  of  a  pair  of  reinforcement  centers  (circles  on 
the  right).  The  positive  reinforcement  center  (PC)  learns  to  predict  the  occurrence  of 
positive  reinforcement  (P).  If  the  signal  N  is  zero  at  all  times,  then  the  reinforcement 
center  NC  has  no  eSect  on  either  behavior  or  learning.  The  modified  ACP  network 
does  not  have  a  negative  reinforcement  center  yet  is  able  to  reproduce  the  simulation 
results  of  the  original  ACP  network.  There  are  two  weights  from  a  given  sensor  i 
to  a  given  motor  center  j:  a  positive  weight  Wij+  and  a  negative  weight  Wij- .  For 
each  motor  center  j,  there  are  two  weights  from  sensor  i  to  the  positive  reinforcement 
center  that  are  facilitated  by  that  motor  center:  ffoiy+  *nd  .  Signals  pass  through 
these  facilitated  connections  only  when  the  associated  motor  center  is  active.  In  the 
notation  used  here,  yj  (with  a  subscript)  represents  the  output  of  one  of  the  motor 


Adaptive  Behavior  Volume  1 ,  Number  3 


323 


An  Optimal  Learning  Control  System 


Leemon  C.  Baird  HI  &  A.  Harry  Klopf 


centers.  The  variable  y  without  a  subscript  represents  the  output  of  the  reinforcement 
center  PC.  Equations  1  through  4  specify  the  calculation  of  the  outputs  of  the  various 
centers: 


yyW  =/ 


[W'y+CO  + 


(1) 


y(0  =/ 


(0  + 


(2) 


/(^)  -  { 


0 

1 


I  ^ 


if  x<e 

if  x>  1 
otherwise 


(3) 


Jmax  —  j  such  that  k^j  y,  (0  >  Yk{t) 


(4) 


The  threshold,  6,  is  a  small  positive  constant.  When  one  motor  center  has  an 
output  larger  than  all  the  othen,  the  index  represents  which  motor  center  output 
is  largest.  This,  in  turn,  determines  which  weights  are  used  to  calculate  the  output 
of  the  reinforcement  center.  If  two  or  more  motor  center  outputs  are  equal  to  the 
maximum  output,  then  j^ix  is  undefined  for  that  time  step,  all  motor  center  and 
reinforcement  center  outputs  go  to  zero,  the  network  as  a  whole  performs  no  action, 
and  no  weights  change  on  that  time  step.  Otherwise,  the  action  associated  with 
motor  center  is  performed  at  time  t,  and  only  those  weights  associated  with 
action  change.  The  changes  in  those  weights  are: 

^ ^ij^±0)  =  ^0  +  I  [y(0  -  yy(')]  (5) 


A W^Oi/™,±(0  =  Ay(t)  Yi Ck  I  iVo^j^±{t  -  fe)|  [Ax.(f  -  k)] (6) 

*=i 


Ay(0  =  K')-y(<-  1) 


if  x,(r)  =  1  and  x,(t  —  1)  =  0 
otherwise 


(7) 

(8) 


The  factors  c^,  c^,  r,  and  are  all  nonnegative  constants.  The  learning 

process  is  divided  into  periods  of  time  called  triak,  and  weights  change  only  at  the 
end  of  each  trial.  The  weight  change  at  the  end  of  the  trial  is  simply  the  sum  of  all 
A IV  calculated  during  the  trial. 

This  AGP  network,  as  described  by  Klopf,  Moi^n,  and  Weaver  (1993),  was 
shown  to  be  capable  of  reproducing  a  wide  range  of  classical  conditioning  results,  as 
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described  in  Klopf  (1988),  and  instrumental  conditioning  results  in  a  variety  of  con¬ 
figurations  of  multiple-T  ma2es.  The  classical  conditioning  phenomena  reproduced 
by  the  ACP  network  include  delay  and  trace  conditioning,  conditioned  and  un¬ 
conditioned  stimulus  duration  and  amplitude  efiects,  partial  reinforcement  effects, 
interstimulus  interval  effects,  second-order  conditioning,  conditioned  inhibition, 
extinction,  reacquisition  effects,  backward  conditioning,  blocking,  overshadowing, 
compound  conditioning,  and  discriminative  stimulus  effects.  The  instrumental  con¬ 
ditioning  results  reproduced  by  the  ACP  network  include  chaining  of  responses, 
habituation,  and  reactions  to  positive  and  negative  reinforcement  within  a  variety  of 
configurations  of  multiple-T  mazes  containing  visual,  tactile,  and  reinforcing  stimuli. 
Not  all  of  these  results  are  reproducible  with  a  standard  Q-learning  system.  For  ex¬ 
ample,  Figure  2  shows  the  behavior  of  a  Q-learning  system  in  a  simple  environment. 
The  environment  consists  of  a  single  state,  a  single  action,  a  constant  reinforcement, 
and  trials  consisting  of  a  single  action  followed  by  a  reinforcement.  During  learning, 
the  Q  value  becomes  equal  to  the  reinforcement.  If  the  environment  changes  so 
that  no  reinforcement  follows  the  action,  then  the  Q  value  extinguishes  to  zero.  If 
the  environment  changes  back  so  that  the  reinforcement  always  follows  the  action, 
then  the  Q  value  becomes  equal  to  the  reinforcement  once  again.  For  a  constant 
learning  rate,  both  acquisition  and  reacquisition  require  the  same  amount  of  time. 
For  a  decreasing  learning  rate,  reacquisition  would  be  shghdy  slower.  For  the  ACP 
network,  reacquisition  is  faster,  consistent  with  animal  learning  experimental  results. 

3  Definition  of  Optimality 

Before  it  is  possible  to  modify  the  network  for  optimality,  or  even  to  discuss  the 
optimality  of  a  control  system,  it  is  necessary  to  define  optimality.  The  performance 
of  a  control  system  may  be  defined  in  terms  of  a  reinforcement  signal,  R{t),  which 
is  received  fiwm  the  environment  on  each  time  step  based  on  the  controller’s  actions. 
A  controller  that  acts  so  as  to  receive  high  values  of  R{t)  is  better  than  a  controller 
that  acts  so  as  to  receive  low  values  of  R{t).  An  optimal  controller  can  be  defined 
as  a  controller  that  chooses  actions  that  maximize  V,  the  expected  value  of  the  total 
discounted  reinforcement: 


This  is  a  standard  definition  in  Markov  decision  process  theory,  reinforcement 
learning  theory,  and  control  theory.  The  function  E()  represents  the  expected  value, 
and  7  is  a  constant  between  zero  and  one.  It  is  assumed  that  at  time  ( the  controller 
looks  at  the  current  state  of  the  system  being  controlled  and  chooses  an  action.  As  a 
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Trial 


Figura  2 

(a)  Q-learning  for  an  environment  consisting  of  a  single  sute,  a  single  action,  and  all  trials  lasting  only 
one  time  step  with  a  consunt  reinforcement  received  for  performing  the  action  in  the  sute.  If  a 
reinforcement  of  1,0  is  always  received,  then  the  Q-Iearning  system  learns  to  anticipate  that 
reinforcement.  If  the  environment  changes  so  that  no  reinforcement  is  ever  received,  then  the 
expecution  extinguishes.  If  the  environment  changes  again  so  that  reinforcement  of  1 .0  is  again 
received,  then  the  association  is  reacquired  at  the  same  rate  as  in  the  initial  acquisition,  (b)  In  the 
original  ACP  network,  modified  ACP  network,  and  single-byer  ACP  network,  reacquisition  is  more 
rapid  than  itutial  acquisition,  consistent  with  animal  learning  experimenul  results.  In  the  first  graph,  the 
learning  rare  a  =  1,0.  In  the  second,  the  learning  rate  constants  q  =  {0.5,0.03,0.15,0.075,0.025}. 

In  both  graphs,  the  discount  factor  7  =  1.0,  and  the  uuoal  wei^ts  are  O.OGl . 


result  of  that  action,  the  state  changes,  and  the  controller  receives  a  scalar  reinforce¬ 
ment  signal,  R{t).  In  a  deterministic  system,  the  new  state  and  the  reinforcement  are 
functions  of  the  old  state  and  the  action  chosen.  In  a  stochastic  system,  the  new  state 
and  reinforcement  are  stochastically  generated  according  to  a  probabihty  distribution 
that  is  a  function  of  the  old  slate  and  the  action. 

This  common  definition  of  optimality  has  a  number  of  features  that  make  it 
useful  and  intuitive.  One  such  property  is  that  a  controller  which  is  optimal  by  this 
definition  can  be  thought  of  as  avoiding  punishment  or  failure  and  seeking  reward 
or  success.  For  example,  actions  that  result  in  positive  values  of  R  are  considered 
better  than  actions  that  result  in  values  of  R  that  are  zero  or,  worse  yet,  negative. 
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Therefore,  if  an  optimal  controller  receives  a  reinforcement  of  zero  most  of  the 
time,  then  positive  values  of  R  are  analogous  to  rewards  and  negative  values  of  R  are 
analogous  to  punishments.  The  controller  performs  actions  that  tend  to  increase  the 
probability  of  receiving  positive  R  signals  and  decrease  the  probability  of  receiving 
negative  R  signals.  Similarly,  if  the  system  receives  a  reinforcement  of  R  =  5  most 
of  the  time,  then  the  actions  performed  by  an  optimal  controller  can  be  interpreted 
as  treating  R  =  4  as  a  punishment  and  R  =  6  as  a  reward.  When  the  reinforcement 
signals  are  interpreted  in  this  manner,  this  definition  of  optimality  yields  intuitively 
reasonable  behavior. 

If  several  different  actions  lead  to  the  same  reward,  it  is  natural  to  define  the  optimal 
action  as  that  which  leads  to  the  earliest  reward.  Conversely,  if  several  actions  would 
all  lead  to  equal  punishments,  it  is  natural  to  define  the  optimal  action  as  that  which 
delays  punishment  for  as  long  as  possible.  When  7  is  between  0  and  1  exclusive, 
equation  9  defines  optimality  in  this  way.  The  exact  value  of  7  determines  the  extent 
to  which  immediate  reinforcement  is  more  important  than  delayed  reinforcement. 
This  value  must  be  chosen  a  priori;  it  cannot  be  learned  or  calculated.  For  example, 
before  a  business  can  be  advised  on  an  optimal  course  of  action,  it  is  necessary  to 
determine  the  goals.  To  what  extent  is  slow  growth  in  the  next  year  acceptable  in 
order  to  allow  greater  growth  over  the  next  5  years?  Is  the  primary  goal  short-term 
profits  or  long-term  growth?  As  another  example,  before  a  control  engineer  can 
design  a  flight  control  system,  it  is  necessary  to  know  the  preferences  of  the  pilots. 
Is  it  preferable  to  decrease  errors  quickly,  yielding  responsive  but  possibly  oscillatory 
controls,  or  is  it  preferable  to  decrease  errors  more  slowly,  yielding  smoother  but 
slu^ish  controls?  There  is  no  “optimal”  answer  to  these  questions;  the  answen 
depend  on  the  preferences  of  those  involved.  Therefore,  there  is  no  “best”  value  for 
7;  it  should  be  chosen  to  reflect  preferences.  Lower  values  of  7  give  higher  priority 
to  the  immediate  future,  whereas  higher  values  of  7  give  more  nearly  equal  priorities. 
If  7  =  1,  then  reinforcement  signab  at  all  points  in  time  are  equally  important  and 
are  given  equal  weight. 

Equation  9  is  abo  a  reasonable  definition  of  optimality  in  the  presence  of  multiple, 
competing  goals.  For  example.  Figure  3  illustrates  the  behavior  of  a  system  that  starts 
at  point  C  and  can  move  along  a  line  with  constant  speed.  The  reinforcement  R(t) 
is  zero  at  all  times  except  for  the  first  time  the  system  reaches  points  A  and  B,  at 
which  time  it  receives  reinforcement  R^  and  Re,  respectively.  If  both  reinforcement 
signals  are  greater  than  zero,  then  points  A  and  B  are  goals.  They  are  conflicting  or 
competing  goals,  since  movement  toward  one  goal  is  movement  away  from  the  other 
goal.  If  R/{  >  Rb  and  C  is  near  the  center  between  A  and  B,  then  the  optimal  policy 
is  to  move  first  to  point  A,  then  to  point  B.  This  action  achieves  the  most  important 
goal  first.  On  the  other  hand,  if  C  starts  close  enough  to  B,  and  if  Ra  is  only  slighdy 
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Figure  3 

The  problem  of  competing  goals.  The  system  starts  at  C  and  receives  reinforcement  when  it  first 
teaches  A  and  Rb  when  it  first  reaches  B.  It  is  assumed  that  Ra  >  Rb  >  0.  When  starting  near  the 
center  (top),  the  optimal  path  goes  to  A  first,  then  B.  When  starting  near  B  (bottom),  the  optimal  path 
goes  to  B  first,  then  A.  The  exact  values  of  Ra,  Rb,  and  y  determine  how  close  to  B  the  system  must 
start  to  have  the  optimal  path  to  reach  B  first.  This  appears  to  be  a  reasonable  definition  of  optimal  in  the 
presence  of  competing  goals. 


greater  than  Rg,  then  the  optimal  policy  is  to  move  to  B  first,  then  A.  It  is  optimal 
to  achieve  the  lesser  goal  first  in  this  case  because  it  reaches  B  much  sooner,  while 
only  slightly  delaying  arrival  at  A.  The  values  of  R^,  Rg,  and  7  determine  exacdy 
how  close  the  starting  point  must  be  to  B  in  order  for  the  optimal  path  to  proceed 
through  B  first.  This  definition  of  optimality  seems  to  be  in  agreement  with  what 
most  people  would  consider  to  be  the  optimal  paths  for  this  problem  of  competing 
goals. 

The  goal  of  maximizing  V  in  Equation  9  is  a  general  definition  of  optimality, 
capable  of  reflecting  most  intuitive  aspects  of  optimahty.  It  can  encompass  goals 
involving  reward,  punishment,  preferences  for  immediate  versus  delayed  reinforce¬ 
ment,  and  competing  goals.  Many  different  types  of  goals  can  be  expressed  in  the 
form  of  this  definition  simply  by  choosing  an  appropriate  value  of  7  and  encoding 
the  value  of  R(f)  appropriately.  Thus,  if  a  learning  system  is  guaranteed  to  learn 
to  maximize  V,  then  it  is  a  general  problem-solving  system  and  is  capable  of  solv¬ 
ing  a  wide  range  of  problems.  Such  systems  are  often  referred  to  as  reinfoKement 
learning  systems,  because  they  must  learn  on  the  basis  of  reinforcement  signals  alone, 
without  being  told  explicidy  or  exacdy  what  acdons  to  perform.  An  overview  of 
reinforcement  learning  algorithms  is  provided  by  Williams  (1987). 
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The  preceding  discussion  defines  what  is  meant  by  an  optimal  policy.  The  fol¬ 
lowing  analysis  proves  that  the  proposed  learning  system  eventually  wUl  learn  an 
optimal  policy  for  any  environment  if  the  environment  is  explored  sufficiendy.  This 
does  not  address  the  question  of  whether  the  optimal  policy  will  be  found  in  the 
minimum  amount  of  time  or  with  the  minimum  punishment  during  learning.  The 
time  required  for  learning  will  depend  critically  on  the  particular  exploration  strategy 
employed.  Another  problem  is  that  of  deciding  whrn  to  accept  a  suboptimal  policy. 
It  may  be  preferable  to  converge  to  a  suboptimal  policy  if  the  optimal  policy  is  only 
slighdy  better  and  if  a  large  amount  of  punishment  would  be  incurred  during  the 
additional  exploration  required  to  find  the  optimal  policy.  When  these  considera¬ 
tions  are  taken  into  account,  the  definition  of  optimal  and  the  value  of  7  must  still 
be  chosen  a  priori.  They  cannot  be  learned.  What  does  change  with  the  additional 
considerations  is  the  definition  of  when  the  reinforcement  must  be  maximized.  The 
simpler  problem  is  to  maximize  reinforcement  during  a  period  following  an  explo¬ 
ration  and  learning  period.  The  more  difficult  problem  is  to  maximize  reinforcement 
over  all  time,  even  during  the  early  stages  of  learning.  In  both  cases,  the  goal  is  to 
maximize  V  as  specified  in  equation  9,  but  in  one  case  I  =  0  is  defined  as  the  time 
at  which  learning  begins,  and  in  the  other  I  =  0  is  the  time  at  which  learning  ends 
and  the  learned  policy  is  used. 

The  theory  of  learning  automata  deals  with  this  problem  of  maximizing  reinforce¬ 
ment  during  learning.  The  majority  of  the  work  in  learning  automata  has  assumed 
very  simple  environments,  such  as  environments  in  which  there  is  only  a  single  state, 
the  reinforcement  due  to  an  action  is  received  immediately,  and  the  probability  of 
reward  for  each  action  is  constant.  It  appears  to  be  difficult  to  extend  these  theoretical 
results  to  include  learning  systems  for  Markov  sequential  decision  processes.  For  an 
overview  of  learning  automata  theory,  Narendra  and  Thathachar  (1974,  1989)  may 
be  consulted.  Gittins  (1989),  who  developed  some  of  the  most  important  theoret¬ 
ical  results  in  the  field,  describes  them  and  provides  an  in-depth  survey  of  learning 
automata  theory. 

4  Q-Learning 

A  learning  controller  must  store  information  and  modify  the  stored  information 
during  the  course  of  learning.  If  the  goal  is  to  maximize  total  reinforcement,  then 
there  are  a  number  of  different  types  of  information  that  might  be  stored.  For 
example,  it  might  be  useful  to  store  a  policy,  a  specification  of  which  action  to 
perform  in  each  state.  It  might  also  be  useful  to  store  an  evaluation,  an  estimate  of 
the  maximum  total  reinforcement  that  can  be  achieved  when  starting  in  each  state. 
A  model  could  also  be  stored  which,  for  a  given  action  performed  in  a  given  state. 
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would  predict  the  state  on  the  next  tune  step.  Watkins  (1989)  proposed  a  system 
called  Q-leaming  which  stores  Q  values  instead  of  policies,  evaluations,  or  models. 
In  this  system,  a  number  called  a  Q  value  is  stored  for  each  action  in  each  state. 
The  Q  value  for  a  given  state-acdon  pair  represents  an  estimate  of  the  maximum 
total  reinforcement  that  can  be  achieved  by  a  sequence  of  actions  that  starts  with  the 
given  action  in  the  given  state.  Watkins  (1989)  proved  that  a  system  that  stores  only  Q 
values  can  learn  to  be  an  optimal  controller  according  to  the  definition  of  optimality 
given  in  equation  9.  This  is  guaranteed  if  the  Q  values  are  updated  according  to: 

QW0i“(0]  ^  1)  +7max{Q[x(t+  1),m]}  (10) 

U 

The  arrow  with  an  a  above  it  represents  the  operation  of  changing  the  Q  value 
so  that  it  moves  closer  to  the  value  of  the  expression  on  the  right  side  of  the  arrow. 
Equation  10  is  equivalent  to: 

Q{x(0,«(f)]  < —  (1 -a)  {QMO.  "(')]} 

+  a  ^R(f  +  1)  +  7max{Q[x{/  +  1),  m]}^  (11) 

The  parameter  a  is  a  number  between  0  and  1  that  controls  the  rate  of  change  of 
the  Q  value.  If  a  is  1 ,  the  Q  value  changes  instandy  to  be  equal  to  the  right  side  of 
equation  10.  If  a  is  close  to  0,  the  Q  value  is  changed  litde  by  a  single  update. 

If  the  system  is  in  state  x{t)  at  time  t,  and  the  controller  performs  action  m(() 
in  response  to  that  state,  then  the  new  sute  is  x{t  +  1).  The  value  R{l  +  1)  is  the 
reinforcement  received  as  a  result  of  performing  action  u{t)  in  state  x(t).  Consistent 
with  the  notation  used  throughout  this  article,  R{t  +  1)  is  a  function  of  u(t)  and 
x(f),  not  M(t  +  1)  or  x(t  +  1).  The  maximum  of  all  the  Q  values  in  the  new  state 
represents  an  estimate  of  the  maximum  achievable  discounted  sum  of  reinforcement, 
starting  with  R{i  +  2).  Thus,  the  sum  on  the  right  side  of  equation  10  represents  an 
estimate  of  what  Q[x(f),M(t)]  should  be.  The  arrow  in  equation  10  represents  the 
act  of  updating  the  stored  value  for  Q[x{t),  M(f)]  so  that  it  moves  closer  to  the  value 
of  the  right  side.  If  the  reinforcement  is  stochastic,  then  it  is  useful  to  change  the 
Q  value  slowly  and  update  it  multiple  times.  This  allows  the  Q  value  to  converge 
to  the  expected  value  of  the  future  reinforcement.  Watkins  (1989)  proved  that  if  a 
for  each  Q  value  approaches  zero  at  an  appropriate  rate,  and  if  all  the  Q  values  are 
updated  infinitely  often,  then  the  Q  values  are  guaranteed  to  converge  to  the  correct 
values.  In  practice,  most  Q-learning  systems  are  implemented  with  a  held  constant 
for  all  Q  values. 

Some  of  the  ideas  behind  these  algorithms  were  first  utilized  in  Samuel’s  checkers- 
playing  program  (1959,  1967).  Some  aspects  of  Watkin’s  (1989)  Q-learning  algo¬ 
rithm  were  independently  proposed  by  Werbos  (1989)  in  a  system  called  action  de¬ 
pendent  heuristic  dynamic  programming  (ADHDP),  or  back-propagated  adaptive  critic 
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(BAG).  A  simulation  of  the  ADHDP  system  is  described  in  Lukes,  Thompson,  and 
Werbos  (1990).  The  issue  of  finding  the  maximum  of  the  stored  Q  values  for  a  state 
is  addressed  in  Baird  (1992).  Wilhams  and  Baird  (1990)  discuss  the  convei^ence  prop¬ 
erties  of  other  dynamic  programming  systems  for  learning  control.  Barto,  Sutton, 
and  Watkins  (1990)  relate  dynamic  programming  to  temporal  difference  methods, 
prediction,  and  classical  conditioning.  Reinforcement  learning,  temporal  difference 
methods,  and  prediction  ate  reviewed  in  mote  detail  in  Sutton  (1988),  Barto  (1989), 
Sutton  and  Barto  (1981),  and  Dayan  (1992).  Other  issues  arising  in  Q-learning  are 
discussed  in  Barto  and  Bradtke  (1991)  and  Sutton,  Barto,  and  Wilhams  (1991).  Thrun 
(1992)  considers  the  issue  of  exploration  of  the  environment  with  Q-learning  sys¬ 
tems.  Chrisman  (1992)  and  Whitehead  and  Ballard  (1991)  consider  perceptual  aliasittg, 
the  problem  arising  when  different  states  yield  the  same  sensor  inputs.  Sutton  (1990) 
considers  the  incorporation  of  a  model  of  the  environment  into  the  learning  sys¬ 
tem.  Mahadevan  and  Connell  (1991),  Lin  (1991),  and  Singh  (1992)  consider  more 
complex,  hierarchical,  or  modular  structures  of  Q-learning  controllers.  The  control 
actions  generated  by  the  controllers  described  later  are  discrete  and  are  a  determinis¬ 
tic  function  of  the  state,  but  GullapaUi  (1990,  1991a,  1991b)  and  MiUington  (1991) 
have  considered  systems  where  the  actions  can  be  analog  and  stochastic.  In  addition, 
reinforcement  learning  systems  have  been  found  to  work  well  on  difficult  problems. 
Tesauro  (1990,  1992)  has  applied  these  ideas  successfully  to  the  problem  of  play¬ 
ing  the  game  of  backgammon,  and  Sofge  and  White  (1990)  apphed  reinforcement 
learning  to  automated  machinery  for  creating  thermoplastic  composites.  Not  only 
is  a  Q-learning  controller  guaranteed  to  learn  eventually,  but  it  also  appears  to  do  so 
mote  quickly  than  other  reinforcement  learning  systems  for  some  problems.  Barto 
and  Singh  (1990),  and  Lin  (1992)  demonstrate  that  a  Q-learning  controller  learns 
faster  than  model-based  learning  systems  for  the  particular  problems  investigated. 
Results  such  as  these  indicate  that  systems  employing  Q-learning  may  have  signifi¬ 
cant  potential  for  optimal  learning  control.  We  show  that,  given  certain  values  for 
the  parameters,  a  modified  form  of  the  AGP  network  reduces  to  Q-learning  and  is, 
therefore,  optimal. 

5  Optimality  of  the  Modified  ACP  Network 

Gonsider  a  discrete-state,  discrete-time  Markov  sequential  decision  process  that  is  to 
be  controlled.  At  each  point  in  time,  the  process  is  in  one  of  m  states  and  the  controller 
has  a  choice  of  n  possible  actions.  If  action  i  is  performed  while  the  process  is  in  state 
j,  then  with  probabUity  Vjk,  on  the  next  time  step  the  process  will  be  in  state  k  and  the 
controller  will  receive  a  reward  ‘Rjk.  Rmu  and  Rmin  arc  defined  to  be  the  maximum 
and  minimum  of  all  the  reward  values,  respectively.  Gosts  associated  with  transitions 
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are  represented  by  rewards  less  than  zero.  A  policy  is  a  specification  of  which  action 
the  controller  should  perform  in  each  state.  Given  all  the  values  of  P  and  R,  the  goal 
is  to  find  the  optimal  policy,  as  defined  by  equation  9.  The  values  P  and  R  define 
a  system  to  be  controlled,  and  the  policy  defines  a  controller  for  that  system.  The 
problem  is  made  more  difficult  if  only  7,  Rmax.  and  f^min  are  given  a  priori.  Then, 
P  and  R  must  be  discovered  by  the  controller  through  the  generation  of  actions  and 
the  observation  of  results.  This  is  the  problem  considered  by  Watkins  (1989). 

We  now  describe  a  modified  form  of  ACP  network  that  has  the  following  prop¬ 
erties.  Given  only  7,  Rm^x.  and  Rmin.  the  network  is  guaranteed  to  learn  the  optimal 
policy  in  every  state.  An  ACP  network  has  a  set  of  binary  inputs,  {5,, . . . ,  5„},  a  set 
of  binary  outputs  {ei , . . . ,  e„},  and  a  real-valued  input,  R,  representing  the  reward 
signal.  There  is  also  an  additional  input,  N,  but  this  input  is  not  needed  and  will  be 
assumed  to  be  zero  at  all  times.  The  ACP  network  can  therefore  be  interfaced  in  a 
natural  manner  with  the  Markov  system  described  earher.  If  the  system  is  in  state 
I  at  time  /,  then  s,(t)  =  1  and  Sj(t)  —  0  for  i  ^  j.  At  any  point  in  time,  the  ACP 
network  has  at  most  one  nonzero  output.  If  e,(r)  >  0,  then  action  i  is  performed  at 
time  t.  If  all  outputs  are  zero,  then  the  action  associated  with  output  ei  is  performed. 
If  performing  action  i  in  state  j  at  time  t  causes  a  transition  to  state  k  at  time  t  +  1 
and  yields  a  reward  ‘Rji,,  then  R(f  +  1 )  may  be  defined  as: 


R(/+l)  = 


{‘Rjk  -  Rm.„)  (1  -  7) 


f^itiix  Rit 


(12) 


Thus,  the  “reward”  generated  by  the  Markov  process,  ’Rjk,  goes  through  a  linear 
transformation  to  become  the  “reinforcement”  experienced  by  the  learning  con¬ 
troller,  R(t  +1).  This  linear  transformation  of  the  network’s  reinforcement  input 
normalizes  it  so  that  R(/  +1)  always  stays  within  the  range 


0, 


1  +  7 


and  the  expected  total  discounted  reward  for  any  policy  will  be  in  the  range  [0,  l]. 
This  simplifies  the  selection  of  parameters  for  the  network,  because  all  signals  ■within 
the  network  can  remain  in  the  range  [O,  l]  at  all  times.  A  policy  will  be  optimal  for 
this  transformed  problem  if  and  only  if  it  is  optimal  for  the  original  problem,  so  this 
normalization  has  no  effect  on  the  behavior  of  the  system. 

The  analysis  provided  next  requires  that  the  Markov  system  change  state  on  every 
time  step.  If  it  is  possible  that  the  system  being  controlled  does  not  have  this  property, 
then  the  interface  between  the  network  and  the  Markov  process  must  be  modified 
slightly.  The  number  of  inputs  to  the  network,  n,  will  have  to  be  twice  the  number 
of  states,  with  two  inputs  uniquely  associated  with  each  state.  If  the  Markov  system 
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is  in  a  given  state  for  multiple  time  steps,  then  on  the  first  time  step  in  that  state 
one  of  the  two  associated  inputs  will  be  1  and  the  other  0.  On  the  next  time  step, 
the  values  will  be  reversed,  and  they  will  alternate  on  each  time  step  while  in  that 
state.  For  each  state,  the  same  input  will  start  with  1  every  time  a  transition  is  made 
into  that  state  from  another  state.  The  learning  system  therefore  sees  a  single  Markov 
process  with  «  states,  wherein  the  probability  of  transitioning  from  any  state  to  itself 
is  zero  for  all  actions. 

Given  this  interface,  the  AGP  network  will  have  4«»»  parameters,  called  weights, 
which  can  change  during  learning.  For  a  given  state  i  and  an  action  j,  there  are 
four  weights  associated  with  that  state  and  action:  iVij-,  W^oi>+.  and  ILoij-- 

Each  weight  marked  “+”  is  a  positive  real  number,  and  each  weight  marked  ” 
is  a  negative  real  number.  The  sum  (H^oi>-i-  +  M^Oij-)  represents  an  estimate  of  the 
expected  total  discounted  reward  received  if  action  j  is  performed  in  state  i  followed 
by  optimal  actions  in  all  subsequent  states.  The  other  weights  are  constandy  adjusted 
so  that  the  sum  ( )  will  tend  over  time  to  become  equal  to  ( fVoij^  +  IVoij- ) . 

The  AGP  network  is  guaranteed  to  solve  Markov  sequential  decision  problems  if 
six  modifications  are  made  to  it; 

1 .  The  definition  of  y{t)  in  equation  2  should  not  include  the  R(t)  term, 

yielding: 


y(0  =/  E  +  ^v^-)  (M) 


2.  Ay(f)  in  equation  6  should  be  replaced  with  an  expression  including 
R(t)  and  the  discount  factor  7,  yielding  a  modified  drive-reinforcement 
learning  mechanism; 

=  MO  -  y(t  -  1)  +  R(t)] 

r 


3.  The  weights  should  change  at  every  time  step  instead  of  only  at  the  end 
of  each  trial.  This  change  makes  the  exact  timing  of  the  calculations 
critical.  If  a  AlV(0  is  calculated  that  would  change  the  index  jnux(0> 
then  the  weight  should  change  and  the  calculations  should  be  repeated 
during  that  time  step.  Therefore  the  order  of  the  calculations  should  be; 

•  Galculate  each  y,(t)  for  all  i. 

•  Galculate  =  j  such  that  yj{t)  is  maximum  (the  lowest  such  j  in 
case  of  tie). 
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•  Calculate  y(t). 

•  Calculate  each  AW'y^(t)  and  i^Wog^(t). 

•  Replace  each  Wy^it)  with  Wg^{t)  +  AWij^{t). 

•  Recalculate  and  y(t)- 

•  Recalculate  each  A  H'(t). 

•  Calculate  each  IV {t+  I). 

4.  The  network  should  be  able  to  operate  in  an  exploratory  mode  and  in  a 
controller  mode.  In  controller  mode,  the  index  should  be  the  index  of 
the  highest  yy(t).  In  exploratory  mode,  the  index  should  be  chosen 
by  some  other  mechanism.  The  only  constraint  on  the  choice  of  jnux 
that,  if  the  system  stays  in  exploratory  mode,  the  system  must  eventually 
try  every  action  in  every  state  infinitely  ofi:en. 

5.  If  the  Markov  process  is  nondeterministic,  then  the  learning  rate  c\  must 
slowly  decay  to  zero  to  ensure  weight  convergence.  If  c\  is  monoton- 
ically  nonincreasing,  and  if  time  t„  is  the  first  point  in  time  such  that 
every  action  has  been  tried  in  every  state  at  least  n  times  prior  to  r„,  then 


C]  (t)  should  decrease  at  a  rate  that  satisfies; 

lim  =  0  (1 5) 

n— »oo 
00 

Y^Ciitn)  =  00  (1 6) 

B=l 

6.  I  W\  should  be  removed  fiom  equations  5  and  6,  yielding; 

+  4|y(<)l]  [Axi(0]’^  [y(f)  -  y,(r)]  (17) 

A =  Ay(0  [Ax.(,_*)] ^  (l 8) 


Modifications  1  and  2  are  important  changes  that  affect  the  behavior  of  the  net¬ 
work  in  significant  ways.  They  are  essential  to  optimality  and  also  yield  improvements 
in  the  behavior  of  the  network,  as  described  later.  Modifications  3,  4,  and  5  ate  the 
obvious  properties  necessary  in  almost  any  learning  system  to  solve  infinite  horizon 
Markov  decision  problems  with  unknown  transition  probabilities.  Modification  6  is 
assumed  true  in  the  following  analysis.  Instead  of  removing  the  )  IV\  factors  fiom  the 
equation,  they  can  simply  be  made  to  have  an  arbitrarily  small  effect.  This  is  done 
by  initializing  the  weights  to  large  values  and  using  correspondingly  small  learning 
constants.  The  weights  then  change  by  arbitrarily  small  percentages  during  learning 
and  so  have  arbitrarily  small  effects  on  the  rate  of  learning.  Although  conditions  4 
and  5  are  needed  for  guaranteed  optimality,  they  may  not  be  necessary  in  practice. 
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Only  modifications  1 ,  2,  and  3  were  used  in  the  experimental  results  presented 
later.  Modification  4,  exploration,  was  not  found  to  be  necessary  for  the  cart-pole 
problem  we  employed.  Exploration  mechanisms  remain  an  area  for  future  research. 
Modification  5,  decaying  learning  rates,  is  of  theoretical  importance  but  is  genenJly 
not  implemented  in  reinforcement  learning  systems.  Modification  6  simplifies  the 
network  and  aids  in  the  analysis  of  the  network  but  has  one  negative  efiect.  During 
classical  (delay)  conditioning  of  animals  or  of  the  theoretical  learning  system  de¬ 
scribed  in  Klopf  (1988),  the  output  of  the  system  increases  slowly  at  first,  then  more 
rapidly,  then  slowly  again.  This  S-shaped  learning  curve  occurs  when  an  association 
is  learned  for  the  first  time.  If  the  response  extinguishes  and  is  then  reacquired,  learn¬ 
ing  the  reacquired  response  takes  less  time  than  learning  the  original  response.  This 
property  is  due  to  the  presence  of  the  j  H^l  factor  in  the  equations  and  because  all 
weights  come  in  excitatory-inhibitory  pairs.  If  the  |  IV\  factor  is  removed  from  the 
equation  (or,  equivalently,  the  initial  values  of  the  weights  are  large),  then  learning 
is  never  S-shaped.  For  the  modified  network  described  here,  the  |  IV\  factor  remains 
in  the  equation  for  compatibility  with  previous  conditioning  results.  Small  initial 
weights  are  used  for  the  conditioning  simulations,  and  large  initial  weights  are  used 
for  the  cart-pole  control  problem. 

It  has  been  verified  in  simulation  that  an  ACP  network  with  all  of  these  modifi¬ 
cations  replicates  the  results  in  Klopf,  Morgan,  and  Weaver  (1993;  see  also  the  article, 
“A  Hierarchical  Network  of  Control  Systems  that  Learn”).  These  modifications, 
therefore,  do  not  adversely  affect  any  of  the  network’s  demonstrated  abilities. 

Because  of  the  interface  with  the  Markov  process  previously  described,  any  given 
Xi  will  never  be  nonzero  for  more  than  one  consecutive  time  step.  Therefore,  if  on 
any  given  time  step  x,  equals  1,  then  Axi{t)  and  [Ax,(<)]''‘  will  both  equal  1.  If  x,  is 
0  on  a  given  time  step,  then  Ax,(t)  will  be  0  or  —1  and  [Ax,(t)]‘*'  will  be  0.  Thus 
[Ax,(t)]+  is  always  equal  to  x,(f),  which  is  always  0  or  1 .  Using  this  fact  and  choosing 
values  for  the  arbitrary  constants  so  that  r  =  1,  C,  =  1/2,  Cj  =  Cq  =  0  =  0,  the 
equations  describing  the  network  reduce  to  the  following: 


n 


yjU)  =  E  [^+(0  +  ^-(0]  ^.(0 

1=1 

(19) 

m 

yiO  =  E  ^.(0 

1=1 

(20) 

jmax  =  maximum  j  such  that  V  fe  );,(r)  >  y*(t) 

(21) 

‘f 

^"“1  0  otherwise 

(22) 

•h 
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1 


<;i(7yW  +  PR(0  -y('-  1)) 
0 


if  Xiit  -  1)  0 

otherwise 


(23) 


When  the  system  is  initialized,  each  motor  center  will  give  the  same  output  as 
the  reinforcement  center,  PC.  If  excitatory  weights  are  initialized  greater  than  1.0 
and  inhibitory  weights  ate  initialized  less  than  —1.0,  then  the  weights  will  never 
change  enough  to  reach  zero,  and  equation  22  will  always  hold.  Equation  22  ensures 
that  on  every  time  step,  the  excitatory  and  inhibitory  weights  to  the  active  motor 
center  will  each  change  so  as  to  eliminate  half  of  the  error  in  the  motor  center’s 
output.  Therefore,  the  sum  of  those  weights  will  change  so  that  the  output  will 
track  perfecdy  the  output  of  the  reinforcement  center.  This  ensures  that  the  actions 
selected  are  always  the  actions  that  maximize  the  output  of  the  reinforcement  center. 
Equation  23  ensures  that,  when  a  given  action  is  performed  in  a  given  state,  the 
weights  associated  with  that  state-action  pair  are  updated  to  take  into  account  the 
reinforcement  received  for  that  action  as  well  as  the  reinforcement  center  output  on 
the  next  time  step. 

Given  these  properties,  the  system  reduces  to  Q-learning.  While  performing  a 
given  action  in  a  given  state,  the  output  of  the  reinforcement  center  can  be  inter¬ 
preted  as  the  Q  value  for  that  state-action  pair,  which  is  the  expected  discounted  re¬ 
ward  for  performing  that  action  followed  by  optimal  actions  thereafter.  Equations  19 
through  23  then  ensure  that  the  weights  are  changed  in  accordance  with  equation  10, 
as  described  in  Watkins  (1989)  and  Watkins  and  Dayan  (1992).  Given  sufficient  ex¬ 
ploration,  Watkins  has  proved  that  a  Q-learning  system  will  always  converge  to  the 
optimal  solution.  Therefore,  for  these  parameter  values  and  modifications,  the  mod¬ 
ified  AGP  network  will  also  converge  to  the  optimal  solution.  Thus,  the  preceding 
analysis,  together  with  Wafxins’s  results,  constitute  a  proof  of  optimality  for  the 
modified  AGP  network. 

It  is  interesting  to  note  that  if  the  Markov  process  is  deterministic  (each  'Pji,  is 
either  zero  or  one),  then  the  modified  AGP  network  will  find  the  optimal  policy  for 
every  state  that  it  visits  sufficiently  often.  There  is  no  need  to  explore  exphcitly  by 
perfornung  an  action  with  a  low  Q  value;  it  can  simply  perform  the  action  with  the 
highest  Q  value  at  all  times,  and  this  will  automatically  result  in  sufficient  exploration, 
as  can  be  shown  by  induction.  If  all  Q  values  ate  initialized  to  optimistic  values  (rep¬ 
resenting  an  incorrectly  high  estimation  of  expected  reward),  then  during  learning 
no  value  will  ever  drop  below  its  correct  value.  If,  in  a  given  state,  a  suboptimal  action 
happens  to  have  the  highest  Q  value,  then  repeatedly  choosing  that  action  will  result 
in  it  approaching  its  correct  value  asymptotically.  It  is  therefore  guaranteed  to  fall 
eventually  below  the  value  of  some  other  action,  at  which  time  that  other  action  will 
be  performed.  This  implicit  exploration  is  guaranteed  to  continue  until  all  Q  values 
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Figure  4 

Single-layet  architecture.  There  is  an  excitatory  and  inhibitory  weight  ftom  each  state  sensor  to  »ach 
node.  There  is  one  node  for  each  possible  action.  On  each  time  step,  the  action  is  performed  that  is 
associated  with  the  node  with  the  largest  output.  Positive,  rewarding  inputs  (P)  and  negative,  punishing 
inputs  (N)  are  combined  to  form  a  single,  global  reinforcement  signal  that  drives  learning.  The  only 
weights  that  change  during  learning  are  the  ones  associated  with  recendy  performed  actions.  This  model 
is  provably  optimal  and  is  capable  of  reproducing  all  of  the  demonstrated  results  of  the  unmodified, 
two-layer  ACP  network. 

for  suboptimal  actions  have  fallen  below  the  correct  Q  value  for  the  optimal  action. 
The  optimal  action  will  be  performed  from  then  on.  In  this  manner,  the  system 
eventually  will  learn  the  optimal  policy  in  every  state  that  is  visited  sufficiently  often, 
even  though  the  system  always  performed  the  action  with  the  highest  Q  value. 

6  Single-Layer  Model 

The  preceding  analysis  of  the  modified  two-layer  ACP  network  suggests  that  it  might 
be  possible  to  simplify  the  network  without  losing  any  of  the  desirable  properties. 
The  network  in  Figure  4  consists  of  only  a  sin^e  layer  of  linear  components,  yet  it 
reproduces  all  the  results  of  the  modified  two-layer  network.  The  single-layer  model 
is  not  the  same  as  either  layer  in  the  two-layer  model;  rather,  it  is  equivalent  to  the 
entire  modified  two-layer  ACP  network.  Each  of  the  mechanisms  in  the  single-layer 
network  is  also  present  in  the  two-layer  network,  such  as  mutual  inhibition  and  global 
training  signals  vinthin  a  layer.  The  two  models  have  difieient  internal  structure  and 
learning  mechanisms  but  identical  behavior.  The  modified  two-layer  ACP  network 
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represents  the  minimum  change  necessary  to  the  original  ACP  network  to  ensure 
optimality.  The  single-layer  network  is  much  simpler  and  has  identical  behavior. 
It  represents  the  minimum  change  necessary  to  a  Q-learning  system  to  reproduce 
the  animal  learning  results.  Some  of  the  mechanisms  in  the  two-layer  network  are 
absent  firom  the  single-layer  model,  such  as  the  presence  of  two  different  learning 
mechanisms,  facilitating  connections,  and  some  of  the  nonlinearities.  On  a  given 
time  step,  each  node  in  the  network  computes  a  weighted  sum  of  the  inputs.  Each 
node  is  associated  with  an  action.  On  a  given  time  step,  the  action  performed  by 
the  system  is  the  action  associated  with  the  node  with  the  largest  output.  Learning 
only  occurs  for  those  weights  associated  with  recendy  performed  actions.  Learning 
is  driven  by  a  single  reinforcement  signal,  which  is  the  sum  of  the  reward  signals 
minus  the  sum  of  the  punishment  signals. 

Equations  24  through  27  define  the  operation  of  the  single-layer,  simplified 
model; 

»(')  =  52  [ (0  +  ^ij-  (0]  Xi{t)  (24) 


jmx{t)  =  maximum  j  such  that  Vfe  yj(t)  >  yk(t) 

=  index  of  the  action  performed  at  time  t  (25) 


AW^±(f) 


bVjr^O)  -  -  1)  +  R(0] 

T 

X  52  I  I  ^ 

fc=l 


(26) 


[Ax,j(t- 


r  Xiit  -  k)  -  Xi{t  -  k  -  \) 

1  0 


if  Xi{t  —  k)  —  Xi{t  —  fe  —  1)  >  0 
andj  =  -  k) 

otherwise 


(27) 


The  only  nonlinearity  associated  with  the  outputs  is  the  process  of  finding  the  max¬ 
imum  output.  The  only  nonlinearity  associated  with  the  weights  is  the  restriction 
that  the  magnitude  of  a  weight  cannot  fall  below  0. 1 .  This  chpping  of  the  weights 
never  occurs  if  the  weights  are  initiahzed  to  sufficiendy  large  values. 

On  time  step  t,  the  outputs  yy(t)  are  calculated  as  hnear  combinations  of  the 
inputs  Xi{t).  The  node  with  the  largest  index  is  found,  and  the  index  of  that  node 
is  labeled  The  action  associated  with  that  output  is  performed,  leading  to 

a  new  state  with  inputs  x,(/  +1).  Only  weights  associated  with  winning  outputs 
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change,  reflecting  learning.  If  a  given  output  is  never  the  largest,  then  the  action 
associated  with  that  output  will  never  be  performed,  and  the  weights  associated  vwth 
that  output  will  not  change.  Exploration  could  be  implemented  by  causing  an  output 
to  win  the  competition  even  though  it  is  not  the  largest  output. 

There  are  two  weights  from  each  sensor  to  each  node;  lVij+  and  Wij-.  The 
weight  lVij+  is  always  positive  and  is  always  negative.  If  iVy+  falls  below  0.1 
during  learning,  then  it  is  set  equal  to  0.1.  If  rises  above  —0.1,  then  it  is  set 
equal  to  —0.1. 

If  the  network  is  interfaced  to  a  Markov  sequential  decision  process,  as  described  in 
the  preceding  section,  then  each  node  corresponds  to  a  possible  action.  Each  sensor 
(or  pair  of  sensors)  corresponds  to  a  diflerent  state,  and  the  sensors  are  binary,  with 
exactly  one  input  equal  to  1  at  any  given  time.  The  sum  (H^j+  +  H^y-)  converges 
to  the  expected  discounted  total  return  for  performing  action  i  in  sate  j  followed  by 
optimal  actions  thereafter.  The  difference  ( PVij+  —  Wy- )  affects  the  speed  of  learning 
for  a  given  state-action  pair,  causing  it  to  learn  slowly  at  first  and  more  quickly  after 
it  has  gained  some  experience.  This  yields  the  initial,  positively  accelerating  portion 
of  the  S-shaped  learning  curve  observed  in  classical  conditioning  experiments. 

The  single-layer  system  reproduces  all  of  the  classical  and  instrumental  condition¬ 
ing  results  achieved  to  date  by  the  two-layer  ACP  network.  By  the  same  procedure 
used  in  the  discussion  of  two-layer  networks,  the  one-layer  system  can  be  reduced 
to  Q-learning  and  is,  therefore,  also  optimal.  For  the  simulations  of  cbssical  and  in¬ 
strumental  conditioning  experiments,  the  sensor  inputs  to  the  network  were  binary 
vecton  that  sometimes  had  multiple  nonzero  elements.  For  the  optimality  analysis 
and  for  the  cart-pole  control  experiments,  the  sensor  input  was  a  binary  vector  with 
exactly  one  nonzero  element  at  any  given  time. 

The  equations  of  both  the  original  two-layer  model  and  the  sin^e-layer  model 
imply  that  each  layer  acts  as  a  winner-take-all  network.  There  "re,  therefore,  two 
types  of  inputs  to  a  given  center;  feedforward  inputs  coming  fiom  sensors  and  from 
other  layers,  and  lateral  inputs  coming  fiom  other  centers  in  the  same  layer.  The 
weights  associated  with  feedforward  inputs  ate  plastic  and  change  during  learning. 
The  connections  associated  with  lateral  inputs  are  hard-wired,  with  excitatory  con¬ 
nections  fiom  each  center  to  itself  and  inhibitory  connections  fiom  each  center  to 
every  other  center  in  the  same  layer.  On  each  time  step,  the  centers  first  calculate 
their  outputs  based  on  their  feedforward  inputs,  then  compete  in  a  winner-take-all 
fashion  based  on  their  lateral  inputs.  The  latest  output  in  the  layer  wins  the  compe¬ 
tition  while  all  other  outputs  decay  to  zero.  The  winning  output  assumes  the  value 
of  the  weighted  sum  of  its  feedforward  inputs,  whereas  the  losing  outputs  remain  at 
zero.  These  operations  are  repeated  on  each  time  step. 

The  reinforcement  centers  in  the  two-layer  model  learn  in  a  manner  similar  to 
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the  centers  in  the  single-layer  model.  In  the  two-layer  model,  learning  is  driven  by 
the  diflPerence  between  the  feedforward  inputs  at  the  end  of  one  time  step  and  the 
feedforward  inputs  at  the  end  of  the  previous  time  step.  In  the  single-layer  model, 
learning  is  driven  by  the  difference  between  the  lateral  inputs  at  the  end  of  one  time 
step  and  the  feedforward  inputs  at  the  end  of  the  previous  time  step.  Thus,  both 
models  use  the  same  structure  within  a  layer,  the  same  set  of  connections,  the  same 
type  of  competition,  and  the  same  type  of  learning.  The  only  differences  are  changes 
as  to  which  inputs  drive  learning  for  each  center.  The  single-layer  model  does  not 
require  any  additional  connections  or  additional  flow  of  information  beyond  that 
found  in  the  two-layer  model. 

7  Conditioning  Results 

The  ACP  network  described  in  Klopf,  Morgan,  and  Weaver  (1993)  is  capable  of  re¬ 
producing  a  number  of  classical  and  instrumental  conditioning  experimental  results. 
The  modified  network  described  in  the  previous  section  retains  this  ability  and  is  also 
provably  optimal.  In  addition,  the  modified  network  solves  a  problem  arising  with 
the  original  network  during  simultaneous  classical  conditioning. 

Sutton  and  Barto  (1990)  discuss  the  behavior  of  various  models  during  simulta¬ 
neous  classical  conditioning.  They  point  out  that  the  original  drive-reinforcement 
model  (Klopf,  1988)  predicts  the  development  of  strong  inhibition  when  a  condi¬ 
tioned  stimulus  (CS)  occurs  simultaneous  with,  or  immediately  following,  an  un¬ 
conditioned  stimulus  (US).  This  inhibition  develops  just  as  quickly  as  other  forms 
of  conditioning  and  becomes  strong  enough  to  inhibit  the  unconditioned  response 
(UR)  completely.  For  example,  when  food  (a  US)  is  placed  in  a  dog’s  mouth,  the 
food  will  cause  salivation  (the  UR).  An  initially  neutral  stimulus  such  as  the  sound 
of  a  bell  (a  CS)  has  no  effect  on  salivation.  The  original  model  predicts  that  if  a  bell 
is  rung  simultaneous  with,  or  just  after,  the  placement  of  food  in  the  mouth,  then 
after  several  repetitions,  the  dog  will  not  salivate  even  when  food  is  placed  in  the 
mouth.  If  neutral  stimuli  were  able  to  prevent  URs  in  this  manner,  it  is  likely  that 
most  animals  would  quickly  lose  their  URs. 

The  modified  two-layer  ACP  network  and  the  single-layer  network  do  not  exhibit 
this  conditioned  inhibition.  If  a  brief  US  is  present  simultaneously  with,  or  slighdy 
before,  the  onset  of  the  CS,  then  no  conditioning  occurs.  If  the  US  is  on  for  a  long 
time,  then  conditioned  excitation  can  occur,  which  then  allows  the  CS  to  elicit  the 
UR,  even  in  the  absence  of  the  US.  All  of  the  results  in  Klopf  (1988)  and  Klopf, 
Morgan,  and  Weaver  (1993;  see  also  the  article,  “A  Hierarchical  Network  of  Control 
Systems  that  Learn”)  have  been  reproduced  using  the  single-layer  network  described 
earlier,  except  that  simultaneous  and  backward  CS-US  conditioning  yielded  no  con- 
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ditioning  or  conditioned  excitation  instead  of  conditioned  inhibition.  In  the  case 
of  simultaneous  conditioning,  this  represents  an  improvement  in  the  abihty  of  the 
model  to  predict  experimentally  observed  animal  learning  phenomena.  In  the  case 
of  backward  conditioning,  as  Klopf  (1988)  noted,  animal  learning  experiments  have 
yielded  both  conditioned  inhibition  and  conditioned  excitation.  The  evaluation  of 
theoretical  models  for  the  case  of  backward  conditioning  remains  a  complex  issue, 
given  the  ambiguous  experimental  evidence. 

8  A  Hierarchical  Network  Architecture 

The  single-layer  network  described  earlier  is  guaranteed  to  learn  the  correct  actions 
eventually.  This  learning  process  could  be  very  slow,  however,  so  it  may  be  useful 
to  look  at  extensions  of  the  architecture  to  speed  learning.  This  section  describes 
a  hierarchy  composed  of  several  of  these  layers.  The  hierarchy  is  first  described 
for  a  standard  control  problem,  and  then  the  application  of  it  to  other  problems  is 
discussed. 

A  standard  control  problem  is  the  cart-pole  inverted-pendulum  problem  consid¬ 
ered  in  Michie  and  Chambers  (1968)  and  in  Barto,  Sutton,  and  Anderson  (1983). 
A  cart  moves  on  a  finite-length  track.  A  pole  is  connected  to  the  top  of  the  cart 
with  a  hinge.  The  goal  is  to  balance  the  pole  on  the  cart  while  the  cart  avoids  the 
ends  of  the  track.  The  goal  must  be  accomphshed  by  applying  a  10-newton  force  lo 
either  the  left  or  right  side  of  the  cart  on  each  time  step.  The  sute  of  the  system  is 
described  by  four  variables; 

X  :  the  position  of  the  cart  (center  of  the  track  is  zero, 
to  the  right  is  positive) 

X  :  the  velocity  of  the  cart 

0  :  the  angle  of  the  pole  fiom  vertical  (to  the  right  is  positive) 

6  :  the  angular  velocity  of  the  pole 

If  the  pole  exceeds  12  degrees  from  vertical,  or  if  the  cart  exceeds  2.4  m  ftom  the 
center  of  the  track,  that  is  defined  to  be  a  failure.  The  learning  system  is  given  no 
indication  of  how  it  is  performing  until  failure  occurs.  Then,  it  is  informed  that  a 
failure  occurred  but  not  whether  the  failure  was  due  to  the  pole  angle  or  to  the  cart 
position.  After  a  failure,  the  cart  and  pole  are  returned  to  the  initial  state,  and  the 
controller  is  allowed  to  continue.  An  error  in  the  controller’s  output  may  not  result 
in  failure  for  many  time  steps.  Therefore,  this  problem  is  substantially  more  difficult 
than  standard  model-reference  control  problems  in  which  perfomunce  information 
is  available  on  every  time  step,  as  in  Morgan,  Patterson,  and  Klopf  (1990). 
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In  Barto,  Sutton,  and  Anderson  (1983),  each  of  the  elements  of  the  state  of  the 
controlled  system  was  divided  into  intervals  having  the  following  boundaries: 

X  :  ±0.8,  ±2.4  m 

X  :  ±0.5,  ±oo  m/s 

6:  0,±1,±6,±12° 

6  :  ±50,  ±oo°/s 

With  these  partitions,  the  state  space  is  divided  into  3x3x6x3=162  distinct 
bins.  The  learning  system  has  inputs  encoding  in  which  bin  the  cart-pole  is  but  not 
where  the  cart-pole  is  within  the  bin.  This  lack  of  information  makes  the  control 
problem  more  difficult,  preventing  the  cart-pole  system  from  being  strictly  a  Markov 
process.  The  learning  system  proposed  by  Barto,  Sutton,  and  Anderson  (1983)  was 
a  network  composed  of  an  associative  search  element  (ASE)  and  an  adaptive  critic  element 
(ACE).  The  ASE- ACE  was  given  162  binary  inputs.  At  each  point  in  time,  the  input 
corresponding  to  the  current  state  of  the  cart-pole  was  set  to  1  and  aU  other  inputs 
were  set  to  0.  fhe  learning  system  was  given  no  a  priori  information  about  which 
bins  were  adjacent. 

Whereas  the  ASE-ACE  used,  in  essence,  a  single  number  from  1  to  162  to  repre¬ 
sent  the  state,  the  hierarchical  system  proposed  here  encodes  the  input  in  four  separate 
numbers,  representing  information  about  each  of  the  four  state  variables.  Each  of 
the  state  variables  is  associated  with  three  bins,  representing  large  positive  values,  large 
negative  values,  and  values  near  zero.  The  cart-position  state  variable  uses  only  two 
bins.  The  partitions  between  bins  are  at  the  same  values  as  listed  previously,  except 
there  is  no  partition  for  6  =  ±6°,  or  for  x  —  —0.8  m.  The  hierarchical  network 
is  given  more  a  priori  information  in  that  it  is  given  information  about  each  state 
variable  individually  instead  of  having  all  of  them  encoded  as  a  single  number.  It  is 
not  clear  how  such  information  could  be  utilized  by  a  single-layer  network.  The  hi¬ 
erarchy  can  therefore  be  thought  of  as  a  means  to  encode  a  priori  information  about 
the  number  of  state  variables  and  the  desired  value  of  each  state  variable  individually. 
The  hierarchy,  with  separate  sensors  for  each  variable,  is  shown  in  Figure  5. 

Each  variable  has  three  intervals  and  two  binary  inputs  associated  with  it.  One 
input  is  1  when  the  variable  is  in  its  lowest-valued  interval,  and  the  other  variable  is 
1  when  it  is  in  its  highest-valued  interval.  Both  inputs  go  to  zero  when  the  variable 
is  in  the  center  interval.  Cart  position,  x,  has  a  left  and  right  interval  but  not  a  center 
interval,  so  it  always  has  one  active  input.  Each  of  the  four  layers  in  the  hierarchy 
is  a  single-layer  controller,  or  ACP  network,  as  described  earlier.  Each  layer  has  two 
inputs,  corresponding  to  its  particular  state  variable,  and  two  outputs,  corresponding 
to  the  action  of  pushing  left  or  pushing  right  on  the  cart. 
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Figure  5 

Hierarchical  network  architecture.  Each  of  the  four  horizontal  layers  shown  is  equivalent  to  the  network 
in  Figure  3.  Each  layer  receives  its  inputs  from  sensors  representing  a  different  state  variable  of  the 
cart-pole  system.  Only  one  layer  is  active  at  any  given  time,  and  the  output  of  that  layer  determines  the 
direction  of  the  force  applied  to  the  cart.  The  lower  layers  respond  to  more  rapidly  changing  state 
variables  and  are,  therefore,  given  higher  priority.  At  any  point  in  time,  the  active  layer  is  the  lowest  layer 
whose  state  variable  is  not  near  zero.  When  all  state  variables  are  near  zero,  the  top  layer  is  active. 


The  hierarchy  is  designed  so  that  exactly  one  layer  is  active  at  any  given  time. 
When  a  layer  is  active,  its  behavior  is  described  by  the  equations  given  for  the  single- 
layer  system.  When  a  layer  is  not  active,  it  freezes  completely.  Therefore  f  —  1  in  the 
network  equations  does  not  represent  the  previous  time  step  but  rather  the  last  time 
step  in  which  a  given  layer  was  active.  If  either  of  the  9  inputs  is  equal  to  1 ,  then 
the  bottom  layer  becomes  active  and  forces  the  other  three  layers  to  be  inactive.  If 
neither  0  input  is  1,  then  the  bottom  layer  becomes  inactive  and  control  can  pass  to 
the  6  layer.  If  either  of  the  9  inputs  is  1 ,  then  the  9  layer  becomes  active  and  forces 
the  two  layers  above  it  to  be  inactive,  and  so  forth.  The  output  of  the  active  layer  on 
a  given  time  step  determines  whether  the  controller  applies  force  to  the  left  or  right 
on  that  time  step.  At  failure,  all  of  the  layers  with  nonzero  inputs  become  active  so 
they  can  learn  from  the  failure. 

The  inputs  to  the  hierarchy  are  ordered  as  shown  in  Figure  5  and  are  ranked 
by  their  rate  of  change.  If  each  variable  is  divided  by  the  difference  between  its 
maximum  and  minimum  values,  then  it  is  possible  to  compare  the  speed  at  which 
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the  normalized  variables  change.  In  the  cart-pole  system,  6  changes  more  quickly 
than  any  of  the  other  variables.  The  parameter  0  varies  from  its  lowest  value  to  its 
highest  value  in  a  fraction  of  a  second,  whereas  x  requires  many  seconds  to  go  from 
one  end  of  its  range  to  the  other.  Each  variable  serves  as  an  input  to  one  layer  of 
the  hierarchy.  The  fastest-changing  variable  is  connected  so  that  the  network  reacts 
immediately  when  it  reaches  extreme  values.  The  network  reacts  to  a  slower  variable 
only  when  all  of  the  faster  variables  are  safely  in  the  center  of  their  ranges.  Although 
the  assigmnent  of  state  variables  to  layers  was  done  a  priori  in  this  experiment,  it 
would  be  possible  for  a  network  to  self-organize  so  that  lower  levels  connected  to 
faster-changing  inputs  and  higher  levels  connected  to  slower-changing  inputs. 

This  hierarchical  ACP  network  is  provided  with  more  a  priori  information  than 
were  the  learning  systems  described  in  Michie  and  Chambers  (1968)  and  Barto,  Sut¬ 
ton,  and  Anderson  (1983).  The  hierarchical  ACP  network  has  a  priori  information 
encoding  which  bins  are  adjacent  and  which  bins  are  near  zero.  This  makes  the 
problem  somewhat  easier.  On  the  other  hand,  as  in  the  case  of  Michie  and  Cham¬ 
bers  (1968)  and  Barto,  Sutton,  and  Anderson  (1983),  the  reinforcement  signal  comes 
only  at  failure.  The  network  is  not  informed  whether  failure  was  due  to  the  pole 
angle  or  the  cart  position.  Thus,  this  test  bed  still  contains  the  difficult  temporal  and 
structural  credit  assignment  problems  inherent  in  the  original  problem  formulation. 
Success  with  this  problem  would  tend  to  indicate  the  usefulness  of  this  hierarchical 
architecture. 

The  hierarchical  ACP  network  can  be  viewed  as  a  type  of  subsumption  architec¬ 
ture  as  proposed  by  Brooks  (1986,  1991a,  1991b).  In  a  behavior-based  robot  using 
the  subsumption  architecture,  the  controller  is  divided  into  layers,  each  of  which 
runs  in  parallel  and  has  direct  access  to  sensors  and  actuators.  Each  layer  is  responsi¬ 
ble  for  a  given  behavior,  such  as  obstacle  avoidance  or  wall  following,  and  the  actions 
generated  by  some  layers  are  capable  of  modifying  or  overriding  actions  generated 
by  other  layen.  Mahadevan  and  Connell  (1991)  developed  a  three-level  subsump¬ 
tion  architecture  robot  with  modules  for  finding,  pushing,  and  unwedging  boxes  in 
a  room.  This  system  used  Q-learning  to  learn  to  find  and  push  the  boxes  across  a 
room.  Lin  (1991)  developed  a  three-level  system  that  used  Q-learning  to  allow  a 
robot  to  follow  walls,  go  through  doon,  and  dock  with  a  recharger.  Singh  (1992) 
has  developed  a  method  for  combining  multiple  simple  behaviors,  each  of  which 
employs  Q-learning.  In  each  of  these  cases,  it  has  been  shown  that  the  hierarchical 
system  can  learn  much  faster  than  a  single  Q-learning  system.  The  problem  of  bal¬ 
ancing  a  pole  would  generally  be  considered  a  single  behavior  and  would  typically 
be  handled  by  a  single  level  of  a  subsumption  architecture.  The  hierarchical  ACP 
network  uses  four  layers  for  this  problem,  one  for  each  of  the  four  state  variables. 
Thus,  by  viewing  as  a  behavior  the  problem  of  keeping  a  particular  state  variable  near 
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zero,  the  hierarchical  ACP  network  can  be  considered  a  fine-grained,  behavior-based 
subsumption  system.  Unlike  some  course-grained  systems,  this  hierarchy  could  in 
principle  be  self-organizing.  This  is  possible  because  each  layer  has  the  same  goal; 
to  keep  its  input  near  zero.  The  assignment  of  inputs  to  layers  could  be  performed 
automatically  by  assigning  faster-changing  inputs  to  lower  levels  and  slower-changing 
inputs  to  higher  levels. 

9  Test  Results 

Computer  simulations  were  performed  of  the  ASE-ACE  controUer  of  Barto,  Sut¬ 
ton,  and  Anderson  (1983),  the  modified  two-layer  ACP  network,  and  the  hierarchical 
network.  Because  the  weight  changes  within  the  single-layer  ACP  network  are  iden¬ 
tical  to  the  weight  changes  in  the  modified  two-layer  ACP  network,  separate  results 
for  the  single-layer  network  are  not  given.  The  parameters  and  equations  for  the 
simulation  are  given  in  the  appendix.  In  the  ACP  networks,  initial  weights  were 
biased  slighdy  so  that,  in  each  state,  the  network  would  initially  cause  force  to  be 
exerted  either  to  the  left  or  to  the  right.  The  action  to  be  given  the  larger  weight  was 
chosen  randomly,  so  the  behavior  of  the  system  on  the  first  trial  was  dependent  on 
the  seed  used  by  the  random  number  generator.  In  the  ASE-ACE  system,  all  weights 
are  initially  equal,  but  the  actions  are  chosen  nondeterministically,  so  that  system  too 
is  affected  by  the  random  number  generator  seed. 

The  cart-pole  system  was  simulated  at  50  Hz,  so  each  time  step  represented  0.02 
second  of  simulated  time.  As  in  Barto,  Sutton,  and  Anderson  (1983),  a  time  step  for 
the  controller  was  defined  as  the  period  that  the  system  was  in  a  single  bin,  so  the 
controller  was  constrained  to  apply  a  constant  force  while  in  a  bin.  Each  trial  started 
with  the  cart  stationary  in  the  center  of  the  track  and  the  pole  stationary  and  vertical 
(all  state  variables  set  to  zero).  The  trial  ended  in  failure  when  the  cart  position 
exceeded  ±2.4  m  or  the  pole  angle  exceeded  ±12  degrees  The  reinforcement  signal 
to  the  controller  was  a  constant  value  throughout  the  trial  and  then  dropped  to  a  lower 
value  at  failure.  The  goal  of  maximizing  reinforcement  was  therefore  equivalent  to 
the  goal  of  postponing  failure  for  as  long  as  possible.  A  controller  was  considered 
to  have  learned  successfully  if  a  trial  reached  80,000  time  steps  (26.7  minutes  of 
simulated  time)  without  failure.  If  a  controller  failed  to  learn  successfully  within 
100  trials,  it  was  considered  unsuccessful.  Each  controller  was  tested  ten  times,  with 
different  random  number  seeds.  Table  1  summarizes  the  percentage  of  the  ten  runs 
in  which  each  controller  successfully  learned,  as  well  as  the  average  time  to  learn. 

In  ten  runs,  the  two-layer  ACP  network  was  successful  only  three  times.  It  often 
became  stuck  performing  a  suboptimal  policy.  This  problem  might  be  overcome  by 
adding  an  exploration  mechanism  but,  instead,  was  addressed  here  by  implementing 


Adaptive  Behavior  Volume  1 ,  Number  3 


345 


An  Optimal  Learning  Control  System 


Leemon  C.  Baird  III  &  A.  Harry  Klopf 


Tabl6  1  Comparison  of  learning  reliability  and  speed 


Training  Time 

%  Success 

Average  Number 
of  Trials 

Average  Number 
of  Time  Steps 

Average  Simulated 
Time 
(minutes) 

Two-layer 

ACP 

30 

77 

75,000 

25.0 

ASE-ACE 

80 

70 

96,51 1 

32.2 

Hierarchical 

ACP 

100 

71 

4016 

1.3 

ACP  =  associative  control  process;  ASE-ACE  =  associative  search  element-adaptive  critic  element. 


a  hierarchical  architecture.  In  those  cases  in  which  the  network  did  learn,  it  learned 
in  a  reasonable  amount  of  time  compared  to  the  ASE-ACE. 

The  ASE-ACE  was  more  reliable,  successfully  learning  to  balance  the  pole  80 
percent  of  the  time.  (The  results  in  Barto,  Sutton,  &  Anderson  [1983]  seem  to 
indicate  that  the  system  learned  successfully  8  times  out  of  10,  in  approximately 
70  trials  on  average.)  The  time  to  learn  was  not  reported  by  Barto,  Sutton,  and 
Anderson,  so  we  simulated  the  ASE-ACE  to  obtain  those  values.  In  10  runs  of  our 
simulation,  the  controller  learned  7  times  out  of  10  and  required  50  trials  on  average. 
A  successful  run  required  an  average  of  96,000  time  steps  to  learn,  which  is  32 
minutes  of  simulated  time. 

The  hierarchical  ACP  network  was  the  most  reliable  network  for  this  particular 
problem.  It  always  learned  to  balance  the  pole  within  the  100-trial  limit.  It  required 
roughly  the  same  number  of  trials  as  the  other  two  controllers  but  required  less 
than  one-twentieth  of  the  simulated  time  for  training.  One  must  be  cautious  in 
generahzing  based  on  results  from  a  single  simulated  plant,  but  it  does  seem  that 
the  hierarchical  network  is  a  promising  approach,  improving  learning  speed  by  more 
than  an  order  of  magnitude  and  improving  the  reliabihty  of  learning  for  the  cart-pole 
problem.  This  architecture  might  be  useful  in  other  regulator  problems,  problems  that 
involve  keeping  state  variables  near  a  given  value.  This  appears  to  be  a  fhiitful  area 
for  future  research. 

One  additional  simulation  was  performed  with  the  two-layer,  1 62-bin  ACP  net¬ 
work,  this  time  involving  a  supervised  learning  task.  In  the  previous  simulations,  the 
entire  period  that  the  system  was  in  a  given  bin  was  treated  as  a  single  time  step.  For 
the  supervised  learning  task,  each  time  step  of  the  simulation  was  treated  as  a  separate 
time  step  by  the  network.  Thus  the  network  would  experience  multiple  time  steps 
and  would  have  multiple  chances  to  change  its  weights  and  its  output,  even  while  it 
was  within  a  single  bin.  The  output  of  the  network  was  observed  by  a  trainer,  which 
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was  preprogrammed  with  a  known  solution  to  the  cart-pole  problem.  Whenever  the 
output  of  the  network  matched  the  desired  output,  the  network  was  given  a  high 
reinforcement  signal.  Whenever  the  output  of  the  network  was  different  from  the 
desired  output,  the  network  was  given  a  low  reinforcement  signal.  Not  surprisingly, 
when  the  initial,  randomly  determined  action  for  a  bin  matched  the  desired  output, 
■  the  network  never  tried  any  other  action.  When  the  initial  action  for  a  bin  was 
incorrect,  it  performed  the  incorrect  action  for  one  time  step,  then  performed  the 
correct  action  from  then  on.  In  this  supervised  case,  when  the  duration  of  a  time 
step  was  0.002  second,  the  network  succeeded  in  learning  to  balance  the  pole  in 
less  than  a  single  trial;  that  is,  it  learned  without  failure.  This  result  demonstrates 
that  the  reinforcement  learning  system  described  here  is  capable  of  utihzing  detailed, 
supervised  training  signals  when  they  are  available. 

10  Conclusions 

The  original  ACP  network  described  in  Klopf,  Morgan,  and  Weaver  (1993;  see  also 
the  article,  “A  Hierarchical  Network  of  Control  Systems  that  Learn”  in  this  issue)  re¬ 
produces  a  variety  of  animal  learning  experimental  results.  The  ACP  network  mod¬ 
ifications  proposed  here,  including  a  modified  drive-reinforcement  learning  mech¬ 
anism,  simplify  the  system,  improve  the  behavior  for  certain  types  of  conditioning, 
and  cause  the  system  to  be  provably  optimal,  while  retaining  the  ability  to  reproduce 
the  experimental  results.  Although  the  single-layer  network  is  guaranteed  eventually 
to  learn  to  control  any  Markov  decision  process,  the  simulation  results  suggest  that 
the  learning  speed  can  be  improved  through  the  use  of  a  hierarchical  architecture. 
The  hierarchical  architecture  that  we  have  proposed  and  tested  improves  the  learning 
speed  for  the  cart-pole  problem  by  more  than  an  order  of  magnitude  while  causing 
the  system  to  converge  to  the  correct  answer  more  reliably.  This  hierarchical  approach 
may  be  general  enough  to  apply  to  other  high-dimensional  regulator  problems. 
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Appendix 


The  cart-pole  equations  were  identical  to  the  ones  used  in  Barto,  Sutton,  and  An¬ 
derson  (1983)  and  Baird  and  Baker  (1990).  The  cart-pole  test  bed  was  simulated  at 
50  Hz  according  to  these  equations: 


F,  +  ml(9f  sin  0,  —  6,  cos6,)  —  /tfSgn(x,) 
ftif  -I-  m 


g  sin  6,  +  cos  9,  ^ 

—  F,— mWf  ^ 

1  _ 

'  ml 

1 

^4  m  cos^  d,  ^ 

rKf+m  J 

\ 

(28) 


(29) 


The  plant  was  simulated  using  Euler’s  method  with  a  time  step  of  0.02  seconds. 
The  parameten  used  were  as  follows: 

g  =  9.8  m/s^  (acceleration  due  to  gravity) 

me  =  1.0  kg  (mass  of  cart) 

m  =  0.1  kg  (mass  of  pole) 

/  =  0.5  m  (half  of  pole) 

Pc  =  0.0005  (coefficient  of  friction  of  cart  on  track) 
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Up  —  0.000002  (coefficient  of  friction  of  pole  on  cart) 

F,  =  ±10.0  newtons  (force  applied  to  cart’s  center  of  mass  at  time  t) 

The  parameters  for  the  modified  two-layer  ACP  network  were; 

7  =  0.95,  Ca  =  0.249,  cy  —  0.0,  r  =  5, 

Cl  =  0.033,  C2  =  0.030,  Ci  =  0.027,  a  =  0.024,  cj  =  0.021 
reinforcement  signal  =  0.006  during  trial,  0.0  at  failure 
minimum  weight  magnitude  =  0.1 
initial  value  of  inhibitory  weight  =  —2.0 

initial  value  of  excitatory  weight  to  positive  reinforcement  center 
=  2.114 

initial  value  of  excitatory  weight  to  motor  centers  =  2.122  (for  biased 
action),  2.121  (for  unbiased  action) 

The  parameters  for  the  hierarchical  network  were: 


7  =  0.95,  T  =  5,  fi  =  0.033,  ca  =  0.030,  Ci  =  0.027,  u  =  0.024, 

C5  =  0.021 

reinforcement  signal  =  0.002  during  trial,  0.0  at  failure 
minimum  weight  magnitude  =  0.1 
initial  value  of  inhibitory  weight  =  —0.7 

initial  value  of  excitatory  weight  to  motor  centers  =  0.73  (for  biased 
action),  0.72999  (for  unbiased  action) 

For  the  supervised  learning  simulation,  the  trainer  calculated  the  desired  force,  F,  as 
follows: 


if  6>  >  50°/s 

then 

F  =  ±10  N 

else 

if  0  <  -50°/s 

then 

F  =  -10  N 

else 

if0>  r 

then 

F  =  ±10N 

else 

if  0  <  -1° 

then 

F=  -ION 

else 

tf  X  >  0.5  m/s 

then 

±10N 

else 

if  i  <  —0.5  m/s 

then 

F=  -ION 

else 

if  X  >  0.8  m 

then 

F=  ±10N 

else 

F=  -ION 

The  classical  and  instrumental  conditioning  results  were  reproduced  with  the  single¬ 
layer  model.  As  in  Klopf,  Morgan,  and  Weaver  (1993),  the  weight  values  and  neuron 
outputs  were  clipped  to  lie  within  the  appropriate  range  for  this  simulation.  If,  during 
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learning,  the  magnitude  of  a  weight  went  outside  the  range  [0 . 1 , 4] ,  it  was  clipped  to 
lie  on  the  border  of  that  range.  For  the  classical  conditioning  simulations,  the  output 
was  forced  to  remain  positive.  When  the  weighted  sum  of  the  inputs  was  negative, 
the  output  was  set  to  zero. 
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